Accuracy and Scaling Phenomena in Internet Mapping 
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A great deal of effort has been spent measuring topological features of the Internet. However, it 
was recently argued that sampling based on taking paths or traceroutes through the network from 
a small number of sources introduces a fundamental bias in the observed degree distribution. We 
examine this bias analytically and experimentally. For Erdos-Renyi random graphs with mean degree 
c, we show analytically that traceroute sampling gives an observed degree distribution P(k) ~ fc" 1 
for k < c, even though the underlying degree distribution is Poisson. For graphs whose degree 
distributions have power-law tails P(k) ~ k~ a , traceroute sampling from a small number of sources 
can significantly underestimate the value of a when the graph has a large excess (i.e., many more 
edges than vertices). We find that in order to obtain a good estimate of a it is necessary to use a 
number of sources which grows linearly in the average degree of the underlying graph. Based on 
these observations we comment on the accuracy of the published values of a for the Internet. 



The Internet is a canonical complex network, and a 
great deal of effort has been spent measuring its topol- 
ogy. However, unlike the Web where a page's outgo- 
ing links are directly visible, we cannot typically ask a 
router who its neighbors are. As a result, studies have 
sought to infer the topology of the Internet by aggregat- 
ing paths or traceroutes through the network, typically 
from a small number of sources to a large number of 
destinations 0, 0, 0, 0, H|, routing decisions like those 
imbedded in Border Gate way Protocol (BGP) routing 
tables H, 0,13, or both Although such meth- 

ods are known to be noisy [' 

t HI 13 0, they strongly 
suggest that the Internet has a power-law degree distri- 
bution at both the router and domain levels. 

However, Lakhina et al. [13 recently argued that 
traceroute-based sampling introduces a fundamental bias 
in topological inferences, since the probability that an 
edge appears within an efficient route decreases with 
its distance from the source. They showed empirically 
that traceroutes from a single source cause Erdos-Renyi 
random graphs G(n,p), whose underlying distribution is 
Poisson , to appear to have a power law degree distri- 
bution P{k) ~ fc" 1 . Here, we prove this evocative result 
analytically by modeling the growth of a spanning tree 
on G(n,p) using differential equations. 

Although it is widely accepted that the Internet, unlike 
G(n,p), has a power-law degree distribution P(k) ~ k~ a 
with 2 < a < 3 H, we may reasonably ask whether 
traceroute sampling accurately estimates the exponent 
a. Petermann and de los Rios [13 and DalPAsta 
considered this question, and found that because low- 
degree vertices are undersampled relative to high-degree 
ones, the observed value of a is lower than the true expo- 
nent of the underlying graph. We explore this idea fur- 
ther, and find that single-source traceroute sampling only 
gives a good estimate of a when the underlying graph has 
a small excess, i.e., has average degree close to 2 and is 
close to a tree. As the average degree grows, so does the 
extent to which traceroute sampling underestimates a. 



Since single-source traceroutes can signficantly under- 
estimate a, we then turn to the question of how many 
sources are required to obtain an accurate estimate of a. 
We find that the number of sources needed increases lin- 
early with the average degree. We conclude with some 
discussion of whether the published values of a for the 
Internet are accurate, and how to tell experimentally 
whether more sources are needed. 

Traceroute spanning trees: analytical results. The set 
of traceroutes from a single source can be modeled as a 
spanning tree If we assume that Internet routing 

protocols approximate shortest paths, this spanning tree 
is built breadth- first from the source. In fact, the results 
of this section apply to spanning trees built in a variety 
of ways, as we will see below. 

We can think of the spanning tree as built step-by-step 
by an algorithm that explores the graph. At each step, 
every vertex in the graph is labeled reached, pending, or 
unknown. Pending vertices are the leaves of the current 
tree; reached vertices are interior vertices; and unknown 
vertices are those not yet connected. We initialize the 
process by labeling the source vertex pending, and all 
other vertices unknown. Then the growth of the spanning 
tree is given by the following pseudocode: 

while there are pending vertices: 
choose a pending vertex v 
label v reached 

for every unknown neighbor u of v, 
label u pending. 

The type of spanning tree is determined by how we choose 
the pending vertex v. Storing vertices in a queue and 
taking them in FIFO (first-in, first-out) order gives a 
breadth- first tree of shortest paths; if we like we can 
break ties randomly between vertices of the same age 
in the queue, which is equivalent to adding a small noise 
term to the length of each edge as in [l5| . Storing pend- 
ing vertices on a stack and taking them in LIFO (last-in, 
first-out) order builds a depth-first tree. Finally, choosing 
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v uniformly at random from the pending vertices gives a 
"random- first" tree. 
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FIG. 1: Sampled degree distributions from breadth-first, 
depth-first and random-first spanning trees on a random 
graph of size n — 10 5 and average degree c = 100, and our 
analytic results (black dots). For comparison, the black line 
shows the Poisson degree distribution of the underlying graph. 
Note the power-law behavior of the apparent degree distribu- 
tion P(k) ~ fc" 1 , which extends up to a cutoff at fc ~ c. 



Surprisingly, while these three processes build different 
trees, and traverse them in different orders, they all yield 
the same degree distribution when n is large. To illus- 
trate this, Fig. ^ shows the degree distributions for each 
type of spanning tree for a random graph G(n,p = c/n) 
where n — 10 5 and c = 100. The three degree distribu- 
tions are indistinguishable, and all agree with the ana- 
lytic results derived below. 

We now show analytically that building spanning trees 
in Erdos-Renyi random graphs G(n,p = c/n) using any 
of the processes described above gives rise to an apparent 
power law degree distribution P(k) ~ fc -1 for k < c. To 
model the progress of the while loop described above, 
let S{T) and U{T) denote the number of pending and 
unknown vertices at step T respectively. The expected 
changes in these variables at each step are 

E[U(T + 1)-U(T)} = -pU(T) 

E[S(T + 1) - S(T)} = P U{T) - 1 (1) 

Here the pU (T) terms come from the fact that a given 
unknown vertex u is connected to the chosen pending 
vertex v with probability p, in which case we change its 
label from unknown to pending; the —1 term comes from 
the fact that we also change v's label from pending to 
reached. Moreover, these equations apply no matter how 
we choose v; whether v is the "oldest" vertex (breadth- 
first), the "youngest" one (depth- first), or a random one 
(random-first). Since edges in G(n,p) are independent, 
the events that v is connected to each unknown vertex u 
are independent and occur with probability p. 



Writing t = T/n, s(t) = S(tn)/n and u(tn) = U(t)/n, 
the difference equations JIJ become the following system 
of differential equations, 



du 
dJ 



ds 

dt 



(2) 



With the initial conditions u(0) = 1 and s(Q) = 0, the 
solution to is 



u(t) 



\ s(t) = l-t 



(3) 



The algorithm ends at the smallest positive root to of 
s(t) = 0; using Lambert's function W, defined as W(x) = 
y where ye v = x, we can write 



to = 1 + -W(-ce~ c ) 

c 



(4) 



Note that to is the fraction of vertices which are reached 
at the end of the process, and this is simply the size of 
the giant component of G(n,c/n). 

Now, we wish to calculate the degree distribution 
P(k) of this tree. The degree of each vertex v is the 
number of its previously unknown neighbors, plus one 
for the edge by which it became attached (except for 
the root). Now, if v is chosen at time t, in the limit 
ri — > oo the probability it has k unknown neighbors is 
given by the Poisson distribution with mean m = cu(t), 
Poisson(?7j, k) = e~ m m k jk\. Averaging over all the ver- 
tices in the tree and ignoring o(l) terms gives 

1 f to 

P(k + 1) = — / dtPoisson(cu(t),k) . 
*o Jo 

It is helpful to change the variable of integration to to. 
Since to = ce~ ct we have dm = —cmdt, and 

_,, 1 f c , Poisson(TO, k) 
P{k + 1) = — / dm v ' ' 

to Jc(l-t ) 



dm 



cm, 

Poisson(m, fc) 



1 

ck\ 



-4t / dTOe- m TO fe - 1 



(5) 



Here in the second line we use the fact that to ~ 1 — e _c 
when c is large (i.e., the giant component encompasses 
almost all of the graph). 

The integral in (J5J is given by the difference between 
two incomplete Gamma functions. However, since the 
integrand is peaked at to = k — 1 and falls off exponen- 
tially for larger to, for fc < c it coincides almost exactly 
with the full Gamma function T(fc). Specifically, for any 
c > we have 



k-l 



< ce 
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and, if fe — 1 = c(l — e) for e > 0, then 



i: 



dm e m m 



m^k — l 



dxe~ x (l + x/c) k 



fe-i 



/>oo 

< e"^" 1 / Axe- x e xi - k - 1),c 
Jo 



e c 



< 



r(fe) 



V2^(fc-l) 



This is o(r(fc)) if e > l/Vfc, i.e., if fc < c 
a > 1/2. In that case we have 

P(* +!) = (!- o(l)) 



ck\ 



1 

ck 



for some 



(6) 



giving a power law k^ 1 up to k ~ c. 

We note that this derivation can be made mathemat- 
ically rigorous, at least for constant c. Wormald ppf 
showed, under fairly generic conditions, that discrete 
stochastic processes like this one are well-modeled by 
the corresponding differential equations. Specifically, we 
can show that if the initial source vertex is in the gi- 
ant component, then with high probability, for all t such 
that < t < t , U(tn) = u(t)n + o(n) and S(tn) = 
s(t)n + o{n). It follows that with high probability our 
calculations give the correct degree distribution of the 
spanning tree within o(l). 

Power-law degree distributions. While the result of the 
previous section shows that power-law degree distribu- 
tions can be observed even when none exist, the evidence 
seems overwhelming that the Internet does, in fact, have 
a power-law degree distribution P(k) ~ k~ a . However, 
as shown in ^21 E3 , traceroute sampling on graphs of 
this kind can underestimate the value of a by under- 
sampling the low-degree vertices relative to the high- 
degree ones. Here we show experimentally that the ex- 
tent of this underestimate increases with the average de- 
gree of the underlying graph. We performed experiments 
on both the preferential attachment model of Barabasi 
and Albert |2J and the configuration model [22|. 

The preferential attachment model of [2l| gives each 
new vertex to edges, and so has minimum degree to and 
average degree 2m. In the extreme case m = 1, the graph 
is a tree, and traceroutes from a single source will sample 
every edge. However, as to increases the fraction of edges 
sampled by a given source decreases. Figure |2] shows the 
observed and underlying degree distributions for different 
values of to. For to — 2, for instance, the observed slope 
is a bs ~ 2.7 instead of the correct value a = 3. 

It is worth pointing out that the average degree, and 
therefore a b s , is highly sensitive to the low-degree part 
of the degree distribution, not just the shape of its high- 
degree tail. For instance, we used the configuration 
model f22j to construct random graphs with minimum de- 
gree fc m in and a power-law tail, i.e., P(k) = for k < fc m i n 



and P{k) oc k~ a for k > fc m ; n . (Note that the normal- 
ization of P(k) then depends on fc m i n .) Here we found 
that a bs is a function of fc m i n , not just of a |23| . We are 
currently extending our analytic calculations to this and 
other degree distributions. 



a- underlying, m=2 
v observed 
^3- underlying, m=20 
observed 

underlying, m=200 
observed 




10 
degree 



FIG. 2: Single-source traceroute sampling for preferential at- 
tachment networks with n — 5 x 10 and varying values of the 
minimum degree m. The extent to which traceroute sampling 
underestimates a increases with m. 



Building unbiased maps. Since single-source tracer- 
outes can significantly underestimate a, especially for 
graphs of large average degree, we now turn to the ques- 
tion of how many sources are needed to obtain a good 
estimate of a. In Fig. [21 we show the observed expo- 
nent (estimated by performing a fit to the high-degree 
tail k 3> m) for preferential attachment networks as a 
function of the number of sources divided by m; it also 
shows the fraction of edges included in the sample. The 
collapse of the data clearly shows that the number of 
sources s we need to converge to within a given error 
from the true exponent grows linearly in m, and the er- 
ror decreases rapidly as s/m increases. For instance, with 
to sources we see 41% of the edges and a b s ~ 2.82; with 
IOto sources, 5 times the average degree, we see 94% of 
the edges and our estimate improves to a b s ~ 2.99. 

Traceroute-based studies |H S S 0, IE 0, E3, E3, O 
suggest an average degree for the Internet of 2.8±0.5. (Of 
course, it may be higher since these studies do not see all 
the edges of the graph.) However, none of these studies 
use more than 12 sources, suggesting that the published 
values of a may still be somewhat low. 

For the Internet, gaining access to an increasing num- 
ber of sources in order to sample traceroutes from 
them can present practical difficulties. However, even 
if the measured exponent increases with each additional 
source — indicating that we still do not have the correct 
value of a, and the "marginal value" of each source 
is nonzero — it may be possible to extrapolate the true a 
from the rate of convergence. 
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FIG. 3: Performance of multi-source traceroute sampling in 
preferential attachment networks as a function of the number 
of sources divided by m. On the left, the convergence of a bs 
to the correct value a — 3; on the right, the fraction of edges 
observed at least once. Both curves collapse, showing that 
the number of sources necessary to counter the sampling bias 
grows linearly with the average degree. 



Conclusions. Unlike the World Wide Web where links 
are visible, the Internet's topology must be queried in- 
directly, e.g., by traceroutes; and, since efficient routing 
protocols cause these traceroutes to approximate short- 
est paths, edges far from the source are difficult to see. 



Lakhina et al. [15| noted that this effect can signifi- 
cantly bias the observed degree distribution, and may 
create the appearance of a power law where none ex- 
ists. We have proved this result analytically for ran- 
dom graphs G(n,p = c/n), showing that single-source 
traceroutes yield an observed distribution P(k) ~ fc" 1 
for k < c. Other mechanisms for observing power laws 
in G(n,p) include gradient-based flows [24|. probabilistic 
pruning |17| . and minimum weight spanning trees 25]; 
however, these arc rather different from our analysis. 

For graphs with a power-law distribution P(k) ~ k~ a 
traceroute sampling underestimates a by under-sampling 
low-degree vertices 03 > an d we have found that the 
extent of this underestimate increases with the network's 
average degree. To compensate for this effect, we have 
found that to estimate a within a given error it is neces- 
sary to use a number of sources that grows linearly with 
the average degree. Given the small number of sources 
used in existing studies, it seems possible to us that the 
published values of a for the Internet are somewhat low. 
In future work, we will measure whether cv hs for the 
Internet increases with the number of sources, and if it 
does, attempt to extrapolate the correct value of a. 
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